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Abstract 



We present a data dependent generalization bound for a large class of 
regularized algorithms which implement structured sparsity constraints. 
The bound can be applied to standard squared-norm regularization, the 
Lasso, the group Lasso, some versions of the group Lasso with overlapping 
groups, multiple kernel learning and other regularization schemes. In all 
these cases competitive results are obtained. A novel feature of our bound 
is that it can be applied in an infinite dimensional setting such as the Lasso 
in a separable Hilbert space or multiple kernel learning with a countable 
number of kernels. 

I Introduction 

We study a class of regularization methods used to learn a linear function from 
a finite set of examples. The regularizer is expressed as an infimum convolution 
which involves a set ^4 of linear transformations (see equation ([1]) below). As 
we shall see, this regularizer generalizes, depending on the choice of the set M, 
the regularizers used by several learning algorithms, such as ridge regression, 
the Lasso, the group Lasso [22], multiple kernel learning [lOJ, the group Lasso 
with overlap [5], and the regularizers in |16) . 

We give a bound on the Rademacher average of the linear function class as- 
sociated with this regularizer. The result matches existing bounds in the above 
mentioned cases but also admits a novel, dimension free interpretation. In par- 
ticular, the bound applies to the Lasso in £2 or to multiple kernel learning with 
a countable number of kernels, under certain finite second-moment conditions. 

Let H he a real Hilbert space with inner product (•, •) and induced norm 

II • II . Let 7VI be a finite or countably infinite set of symmetric bounded linear 
operators on H such that for every x £ H, x ^ 0, there is some linear operator 
M G A4 with Mx ^ and that supj^.^g^ 1 11-^^1 1 1 < c»i where ||| • ||| is the 



operator norm. Define the function : H — > U {00} by 

[mem MeM J 



It is shown in Section 13.21 that the chosen notation is justified, because ||-||^ is 
indeed a norm on the subspace of H where it is finite, and the dual norm is, for 
every z Cz H , given by 

\\4m* = sup ||A/z|| 



MeM 



The somewhat comphcated definition of ||-||^ is contrasted by the simple form 
of the dual norm. 



Using well known techniques, as described in [9] and [2], our study of gener- 
alization reduces to the search for a good bound on the empirical Rademacher 
complexity of a set of linear functionals with ||-||^-bounded weight vectors 

2 " 
TZm (x) = -E sup V (/3, x,) , (2) 

where x — (xi, . . . , Xn) G H" is a sample vector representing observations, and 
ei, . . . , e„ are Rademacher variables, mutually independent and each uniformly 
distributed on {—1, Given a bound on TZm (x) we obtain uniform bounds 
on the estimation error, for example using the following standard result (adapted 
from [5]), where the Lipschitz function is to be interpreted as a loss function. 

Theorem 1 Let X = (Xi,...,X„) be a vector of iid random variables with 
values in H , let X be iid to Xi, let (j) : R ^ [0, 1] have Lipschitz constant L and 
6 S (0, 1). Then with probability at least 1 — S in the draw of X it holds, for 
every /3 £ R'^ with ||/3||^ < 1, that 



E0((/?,X)) < - V0((/?,X,))+L7^A^ (X) + 
n ^-^ 

i=l 

A similar (slightly better) bound is obtained if TZm (X) is replaced by its 
expectation TZm = KTZm (X) (see [2]). 

The following is the main result of this paper and leads to consistency proofs 
and finite sample generalization guarantees for all algorithms which use a regu- 
larizer of the form ([1]). A proof is given in Section [231 



/91n2/(5 



^Our definition coincides with the one in [2,, while other authors omit the factor of 2. This 
is relevant when comparing the constants in different bounds. 
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Theorem 2 Let x — {xi, . . . , x„) <E i?" and TZm (x) &e defined as in (0). T/ien 



Um (x) < 



23/2 



sup 

MeM 



J2\\Mx, 



4=1 







2 + 









E 



^i^M sup Ejll^a^jl'" 



< 



23/2 

n 



\ ^=l 



h 0n|M|^ 

The second inequality follows from the first, the inequality 

n n 

sup ^||A/x,||' <^| 



(a fact which will be tacitly used in the sequel) and the observation that every 
summand in the logarithm appearing in the first inequality is bounded by 1. 
Of course the second inequality is relevant only if is finite. In this case we 
can draw the following conclusion: If we have an a priori bound on ||^||^^ for 
some data distribution, say < C, and X = [Xi, . . . , with Xi iid to 

X, then 

23/2C 



7^A4 (X) < 



thus passing from a data-dependent to a distribution dependent bound. In 
Section [2] we show that this recovers existing results for many regularization 
schemes. 

But the first bound in Theorem[2]can be considerably smaller than the second 
and may be finite even if Ai is infinite. This gives rise to some appearantly novel 
features, even in the well studied case of the Lasso, when there is a (finite but 
potentially large) ■^2-bound on the data. 

Corollary 3 Under the conditions of Theorem\^we have 
7^A^(x)< — / sup ^||A/x,||M2+ /inij] ^ \\Mx.,\A + ^. 



A proof is given in Section 13.31 To obtain a novel distribution dependent 
bound we retain the condition < C and replace finiteness of by the 



condition that 



i?2 := E ^ IIM^II^ < 



(3) 



Taking the expectation in Corollary [3] then gives a bound on the expected 
Rademacher complexity 



23/2^ 



2 + Vln i?2 



(4) 
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The key features of this result are the dimension-independence and the only 
logarithmic dependence on i?^, which in many applications turns out to be 
simply R^^E\\Xf. 

The rest of the paper is organized as follows. In the next section we specialize 
our results to different regularizers. In Section|3]we present the proof of Theorem 
[2] as well as the proof of other results mentioned above. In Section |4] we discuss 
the extension of these results to the £q case. Finally we draw our conclusions 
and comment on future work. 



2 Examples 

Before giving the examples we mention a great simplification in the definition of 
the norm ||'||^ which occurs when the members of have mutually orthogonal 
ranges. A simple argument, given in Proposition[5]below shows that in this case 

MeM 

where M+ is the pseudoinverse of M. If, in addition, every member of A4 is an 
orthogonal projection P, the norm further simplifies to 

m\M= E ii^/^ii' 

PeM 

and the quantity occurring in the second moment condition (jS]) simplifies to 

i?2 ^ \\PXf ^E\\xf . 
PeM 

For the remainder of this section X = {Xi, . . . , Xn) will be a generic iid 
random vector of data points, Xi e H, and X will be a generic data variable, 
iid to Xi. If = M'' we write {X)^. for the A;-th coordinate of X, not to be 
confused with Xk, which would be the fc-th member of the vector X. 

2.1 The Euclidean Regular izer 

In this simplest case we set Ai = {/}, where / is the identity operator on the 
Hilbert space H. Then = ^ 11-^11' ^^"^ bound on the 

empirical Rademacher complexity becomes 

25/2 I 

tim W< — /EII^^II'' 

worse by a constant factor of 2^^/^ than the corresponding result in , a tribute 
paid to the generality of our result. 
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2.2 The Lasso 



Let us first assume that = M'' is finite dimensional and set Ai — {Pi, . . . , Pj} 
where Pfc is the orthogonal projection onto the 1-dimensional subspace generated 
by the basis vector e^. All the above mentioned simplifications apply and we 
have = ||/3||i and ~ ll^lloo- Writing Xik for the fc-th coordinate of 

a data-point Xi, the bound on TZm (x) now reads 

23/2 I 

Um (x)<— ^EII 

If 1 1 -'^ 1 1 02 < 1 almost surely we obtain 

93/2 

TIm (X) < 

which agrees with the bound in [8 on the dominant term (see also P].|15j). 

Our last bound is useless if c? > e" or if d is infinite. But whenever the norm 
of the data has finite second moments we can use Corollary [3] and (|4]) to obtain 

23/2 



TZm (X) < ^ (^2-|-^lnE||X||^ 

For nontrivial results E only needs to be subexponential in n. 

We remark that a similar condition to equation ([3|) for the Lasso, replacing 
the expectation with the supremum over X, has been considered within the 
context of elastic net regularization [4]. 



2.3 The Weighted Lasso 

The Lasso assigns an equal penalty to all regression coefficients, while there may 
be a priori information on the respective significance of the different coordinates. 
For this reason different weightings have been proposed (see e.g. [20]). In our 
framework an appropriate set of operators is M — {aiPi, . . . , auPki • • • }, with 
afe > where is the penalty weight associated with the k-th coordinate. 
Then 

k 

and 

II^IItw* = supafc \zk\ . 

k 

To further illustrate the use of Corollary [3] let us assume that the underlying 
space H is infinite dimensional (i.e. H — £2 (N)), and make the compensating 
assumption that a G H, i.e. '^f. a\ ~ < 00. For simplicity we also assume 
that sup J. afe < 1. Then, if ||X||^ < 1 almost surely, we have both ||-'^^||_v(h, < 1 
and a\ < . Again we obtain 

23/2 > 2 

(X) < — ( 2 + %/lni?2 + 
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So in this case the second moment bound is enforced by the weighting sequence. 

2.4 The Group Lasso 

Let H ^ M."^ and let {Ji,..., J^} be a partition of the index set {!,..., d}. 
We take M = {P.h, • • • , Pj,.} where Pj^ — X^iej, ^he projection onto the 
subspace spanned by the basis vector Ci. The ranges of the Pj^ then provide an 
orthogonal decomposition of M'* and the above mentioned simplifications also 
apply. We get 

ll/3|U=Ell^^./3|l 

£=1 

and 

\\4m* = max||Pj,z|| . 

The algorithm which uses as a regularizer is called the group Lasso (see 

e.g. [1^). It encourages vectors (3 whose support lies the union of a small 
number of groups of coordinate indices. If we know that ||P7jX|| < 1 almost 
surely for all £ G {1, . . . , r} then we get 

93/2 

Um (X) < ^ (2 + VhT^j , (5) 

in complete symmetry with the Lasso and essentially the same as given in [8]. 
If r is prohibitively large or if different penalties are desired for different groups, 
the same remarks apply as in the previous two sections. Just as in the case 
of the Lasso the second moment condition ([3]) translates to the simple form 
E||X||2 < oo. 

2.5 Overlapping Groups 

In the previous examples the members of M always had mutually orthogonal 
ranges, which gave a simple appearance to the norm If the ranges are 

not mutually orthogonal, the norm has a more complicated form. For example, 
in the group Lasso setting, if the groups Ji cover {1, . . . , d}, but are not disjoint, 
we obtain the regularizer of 6 , given by 

{r r 
: {ve)jk = if fc and ^ = ^ 
e=i 1=1 

If llPj^Xill < 1 almost surely for all f G {1, . . . , r} then the Rademacher com- 
plexity of the set of linear functionals with iloveriap (/3) < 1 is bounded as in ([5]), 
in complete equivalence to the bound for the group Lasso. 

The same bound also holds for the class satisfying figroup (/?) < 1, where the 
function Jlgroup is defined, for every /? g M'', as 

r 
f.=l 
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which has been proposed by [3 [23] ■ To see this we only have to show that 
which is accomplished by generating a disjoint partition 



overlap 



< a 



group 



{■^fifci where C J^, writing j3 = Yl\=iPj' P realizing that 



< 



loose. 



The bound obtained from this simple comparison may however be quite 



2.6 Regularizers Generated from Cones 

Our next example considers structured sparsity regularizers as in [16j . Let A 
be a nonempty subset of the open positive orthant in and define a function 

: M'' ^ M by 

If A is a convex cone, then it is shown in [17j that is a norm and that the 
dual norm is given by 

1/2 

VA\a* = sup <( ( I : = A/ II All 1 with A £ A 



The supremum in this formula is evidently attained on the set £ (A) of extreme 
points of the closure of{A/||A||-^:AeA}. For fi E £ (A) let be the diagonal 
matrix with entries ^JJT^Sji and let Ma be the collection of matrices Ma = 
{A/^ : /Lt e £■ (A)}. Then 

II^IIa*^ sup \\Mz\\. 
MeMA 

Clearly A^a is uniformly bounded in the operator norm, so if A is a cone and 
£ (A) is at most countable, then ||-||a, — II'IIa^,: = IMIx* ^^'^ our bounds 
apply. If £ (A) is finite and x is a sample then the Rademacher complexity of 
the class with f^A (/3) < 1 is bounded by 





2.7 Kernel Learning 

This is the most general case to which the simplification applies: Suppose that 
H is the direct sum H — (Bj^jHj of an at most countable number of Hilbert 
spaces Hj. We set M — {Pj}j^j, where P,- : iJ — > iJ is the projection on Hj. 
Then 
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and 

II^IIa^* = sup \\Pjz\\ ■ 

Such a situation arises in multiple kernel learning [TU] or the nonparametric 
group Lasso [13] in the following way: One has an input space X and a collection 
{Kj}^^j of positive definite kernels Kj : X x X ^ M.. Let (/f^ : A" — >■ Hj be the 
feature map representation associated with kernel Kj, so that, for every x,t G X 
Kj{x,t) = {(j>j{x),(l)j{t)) (for background on kernel methods see, for example. 

Suppose that x — {xi, . . . , x„) £ X" is a sample. Define the kernel matrix 
Kj = {Kj{xi,Xk)y^i^^i. Using this notation the bound in Theorem [3] reads 

23/2 / j trK~\ 

7^((0(a;l),...,0(a;„))) < /suptrK, 2+ Jin ^'^^ J . 

n y y supje^trK^y 

In particular, if is finite and Kj{x,x) < 1 for every x € X and j £ J^, then 
the the bound reduces to 

03/2 . , 

^(2 + v/hr^), 

essentially in agreement with [3J [HJ [II] ■ For infinite or prohibitively large J' the 
second moment condition now becomes 

K^Kj {X,X) < oo. 



3 Proofs 

We first give some notation and auxiliary results, then we prove the results 
announced in the introduction. 



3.1 Notation and Auxiliary Results 

The Hilbert space H and the collection M are fixed throughout the following, 
as is the sample size n e N. 

Recall that ||-|| and (•, •) denote the norm and inner product in H, respec- 
tively. For a linear transformation M : R" — > H the Hilbert-Schmidt norm is 
defined as 

where {e^ : z G N} is the canonical basis of M". 

We use bold letters (x, X, e, . . . ) to denote n-tuples of objects, such as 
vectors or random variables. 



8 



Let X be any space. For x = {xi, . . . , Xn) S X", 1 < k < n and y G X we 
use Ti-k^y to denote the object obtained from x by replacing the fc-th coordinate 
of X with y. That is 

Xfe^y = (xi , . . . , Xk-1, y, Xk+l ,Xn) ■ 

The fohowing concentration inequahty, known as the bounded difference in- 
equahty (see McDiarmid 13 ), goes back to the work of Hoeffding We only 
need it in the weak form stated below. 

Theorem 4 Let F : Af" — > R and write 

n 

k=i fi'^^e-^, xeA"" 

Let X ~ (Xi, . . . ,Xn) he a vector of independent random variables with values 
in X , and let X' he iid to X. Then for any t > 

Pr {F (X) > EF (X') +t}< e-2*Vs' . 

Finally we need a simple lemma on the normal approximation: 
Lemma 5 Let a,S > 0. Then 

Proof. For t > 6/a we have 1 < at/S. Thus 

■ 

3.2 Properties of || ||^ and Duality 

We state again the general conditions on the set M. 

Condition 6 M is a finite or countahly infinite set of symmetric hounded linear 
operators on a real separable Hilhert space H such that: 

(a) For every x Cz H with x ^ 0, there exists M G M such that Mx ^ 0; 

(b) swpj^j^j^ 1 11-^1 1 1 < oo, where \\\-\\\ is the operator norm. 
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Denote V (M.) = {v : v = {vM)MeMt Vm € H}, so the definition of 
reads 

||^||^=infJ \\vM\\:veV{M) and ^ Mvm = A ■ 



Theorem 7 We have that: 

(i) W'Wm positive homogeneous and subadditive on l\ (^) ; 

(ii) l\ {M) is a dense subspace of H. If M. is finite or H is finite dimensional 
then ii {M) = H; 

(iii) II -11^ is a norm on l\ {M) . 

Proof, (i) Positive homogeneity of ||-||^ is clear. For subadditivity let /3,7 € 
i\ (A^)- Let e > be arbitrary and choose e V (A^) such that ^m^m -^^m 



7,E 



w 



M 



< ||/3|U+eandE 



MeM 



■ w ' 



e. Then ujf'+w'^ €V (M) and Emsm ^ (^'^ 
is in the feasable set for the definition of + 7||;vi 



' M 



lkI.II<ll7lU- 



= /3 + 7. Thus wP +w'^ 



\\P + i\\m = inf| $Z ||t;M|| : V(A^) and ^ Mvm=P + i\ 

ImgA4 MeM J 

MeM 



MeM 



Since e was arbitrary subadditivity follows. 

(ii) It follows from (i) that ii (M) is a linear subspace of H. Let S be the 
set of finite linear combinations of the form 



K 



S=\J2 ■KGN,MiGM,ViGH 



Then 5 is a linear subspace of ii {M) and contains all vectors of the form 
MMv = M'^v where M G M. and v G H. If x e is perpendicular to all of 
S then for all M E A4 wc must have x ± MMx Mx = 0, which implies 

X = by condition (a). This shows that S and therefore also £i {M) are dense 
in H. The second assertion of (ii) is an easy consequence of the first, 
(iii) Suppose p G li {M), /3 7^ and /3 = J2m with v gV {M). 



o< 11^11 = 



E 



MeM 



< sup lljMj 
MeM 



E ii^^ii 



MeM 
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Taking the infimum on the right hand side we obtain that 

MeM 

where condition (b) was used. Together with (i) this impUes that ||-||^ is a 
norm on £i (Ai). ■ 

From now on wc refer to £i (Ai) as the normed linear space with norm 

Theorem 8 Let z & H. The linear functional /? {(3,z) is bounded on ii {M) 
and has norm 

Mm* = sup 11^2^11 • 

Proof. Let F be the dual norm. By definition 

F{z) = M{s:s \\0\\j^-{p,z)>O,yp€H} 

= infJs: ^ {s\\vM\\-{MvM,z))>0,yvGV{M)\ 
[ MeM J 
= inf {s : s \\v\\ - {Mv, z) > 0, Vt; e H, VM e M} 
= mf{s:s> {v,Mz) ,yv e H,\\v\\ =l,VMe A^} 

= m{{s:s>\\Mz\\,yM€M} 

= sup IIM^II = ll^ll^, . 
MeM 



Proposition 9 // the ranges of the members of M. are mutually orthogonal 
then for peiiiM) 

Mm = E , 

MeM 

where M+ is the pseudoinverse of M. 

Proof. The ranges of the members of M provide an orthogonal decomposition 
of H, so 

/3 = ^ M (M+/3) , 
MeM 

where we used the fact that MM+ is the orthogonal projection onto the range 

of M. Taking vm = M+ P this implies that < Y.MeM On the 

other hand, if /? = YliNeM -^^^> then, applying M+ to this identity we see that 
M+Mvm = M+13 for all M, so 

MeM MeM MeM 

which shows the reverse inequality. ■ 
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3.3 Data and Distribution Dependent Bounds 

We use the bounded difference inequality to derive a concentration inequality 
for linearly transformed random vectors. 

Lemma 10 Let e — (ei, . . . , e„) be a vector of independent real random vari- 
ables with — 1 < ej < 1, and e' iid to e. Suppose that M is a linear transforma- 
tion M -.W" ^ H. 

(i) Then for t > we have 

( -t"^ \ 

Pr{|]Me|| > E||Afe'|| +0 < exp 



HS , 



\HS ■ 



(ii) If e is orthonormal (satisfying Ecigj ~ Sij), then 

E||Me|| < llAfll 

and, for every r > 0, 

Pr{||Me|| >t}< e^/'^exp 



(6) 



-t^ 



yi2 + r)\\Mr^J 

Proof, (i) Define F : [-1, 1]" ^ M by F (x) || A/x|| . By the triangle inequality 

n 

sup {F {Xk^y, ) - F {Xk^y^)f 

fe^i yi,y2e[-i,i], xe[-i,i]" 

n 

< \\M (Xk^y, - Xk^y.2)f 

k=iVuV2&[-l,l], xG[-ia]" 
n 

= Y (yi - 2/2)^ ||Mefc||^ 



< 4I1MI 



HS 



The result now follows from the bounded difference inequality (Theorem |4]). 
(ii) If e is orthonormal then it follows from Jensen's inequality that 



EllAfell < E 



1/2 



1/2 



= |1M|1 



HS ■ 



For the second assertion of (ii) first note that from calculus we get {t — 1) /2 
t^/ (2 + r) > -l/r for aU t e R. This implies that 



(7) 
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Since l/r > l/(2 + r) the inequality to be proved is trivial for t < If 
t > \\M\\jjg then, using E||Me|| < wehave t- E \\Me\\ > t-\\M\\fjg > 

0, so by part (i) and ([7]) we obtain 

Pr{||Me|| > t} = Pr{p/e|| > S||Me|| + (t - S||Me||)} 

< exp ("^(^^^^^ < exp f^i^-^^^^ 

- 2\\M\\l, J- 2\\M\\l, J 

= exp l^zS^llMhiS^2l)j < eiAe-(*/ilM||„sOV(2+0 

- eV-expf ,^ V 



(2 + r)||Af|| 



We now use integration by parts, a union bound and the above concentration 
inequality to derive a bound on the expectation of the supremum of the norms 
||Me||. This is the essential step in the proof of Theorem [2l It is by no means 
a new technique, in fact it appears many times in the book by Ledoux and 
Talagrand [TT], but compared to the combinatorial approach in [3 it seems 
more suited to the study of the problem at hand, and gives insights into the fine 
structure of the logarithmic factor appearing in bounds for Lasso-like methods. 

Lemma 11 Let M be a finite or countably infinite set of linear transformations 
M : M" H and e = (ei,...,e„) a vector of orthonormal random variables 
(satisfying Ee^ej — Sij) with values in [—1, 1]. Then 



E sup llMell < V2 sup \\M\\fjg 2 + 

Proof. To lighten notation we abbreviate Aioo ■— supjj^^^ II-^^IIhs below. We 
now use integration by parts 

E sup p/e|| = 
MeM 

< 
< 

where we have introduced a parameter S >0. The first inequality above follows 
from the fact that probabilities never exceed 1, and the second from a union 
bound. Now for any M G we can make a change of variables and use 



12 MeM ll^^llgg 
supA/eM IIA^IIhs 



Pr<^ sup ||A/e|| >t}dt 
iMeM 



Pri sup ||A/e|| >t\dt 

M.^+S iMeM 



Pr {II A/ell > t}dt, 
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which gives E |lMe|l < |lM|l^_g < Moo, so that 

/ Pr{\\Me\\>t}dt < / Pr{||Me|| > E||A/e|| 

Jm^+s Js 



< 



< 



exp T — dt 



\M\ 



HS 



exp 



2||M|1^, 



where the second inequaUty fohows from Lemma [TO] (i) , and the third from 
LemmaO Substitution in the previous chain of inequahties and using Hoelder's 
inequality (in the £i/£oo-version) give 



1 sup ||i\/e|| < 
AieM 



\MeM J 



exp 



\2Mi 



(8) 



We now set 



\ 



9 1„ ( Y.M<^M W^'^WhS 

ML 



\Mc 



Then 5 > as required. The substitution makes the last term in ^ smaller 
than Moo/ {eV^, and since 1 + 1/ {eV2) < ^/2, we obtain 



E sup ||Afe|| < V2M00 1 ■ 
MeM 



\ 



In 



ML 



Finally we use Vines < 1 + Vln s for s > 1. ■ 

Proof of Theorem [2l Let e = (ei, . . . , Cn) be a vector of iid Rademacher 
variables. For M E M we use Mx to denote the linear transformation Mx : 
R" ^ H given by (Afx) y = J2i (Mx^) yi. We have 



2 / \ 2 

7^A4 (x) = -E sup { fi^y^ eiXi) < -I 

" /3:|l/3U<l\ / " 



M* 



= -E sup ||Afxe|| 



Applying Lemma [TT] to the set of transformations A^x = {AIx : M e M} gives 
2'/'supj,,fg^||Mx||^5 



TIm (x) < 



Substitution of llA/xl 



In 



\ SUPMeA^ II 



HS 



and 



HS 



||Mxi||^ gives the first inequality of Theorem [5] 



sup ||Mx||^5<^ sup llMxJ^^^ 



MeM 



MeM 



^^\\*M 
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gives the second inequality. ■ 

Proof of Corollary [3l From calculus we find that tint > — 1/e for all t > 0. 

For A,B>0 and n e N this implies that 

^In^ ^n[{A/n)\n{B/n) - {A/n)\n{A/n)] < Aln{B /n) + n/e. (9) 
Now multiply out the first inequality of Theorem [2] and use (|9]) with 



A^ sup Ell^^2;,f andB= ^W^^^^f 



MeM i 

Finally use y/a + b < y/a + for a, 6 > and the fact that 2^/'^/y/e < 2. ■ 

4 Extension to the iq{M) Case 

There is a rather obvious extension of our framework, which should be men- 
tioned for completeness: Let q andp be conjugate exponents (i.e. 1/q + l/p = 1) 
and define 

||/3||_^^ = inf ( f E II^A^II") • e i? and ^ ^^"^f = /? 

in analogy to ([T]). Then ||/3||^ is a norm and the dual norm is given by 

i/p 



1^1 



M„* 



The proof of these facts is omitted in this version of the paper. In the following 
we give a result, which can be applied to cases analogous to those in Section [51 
where it recovers existing results up to constant multiplicative factors. 

Theorem 12 Let x. be a sample and TZm^ (x) the empirical Rademacher com- 
plexity of the class of linear functions parameterized by /3 with <l. Then 
forl<q<2 

23/2 



TIm, (x)<— JttpEI 



The proof is analogous to the proof of Theorem [21 but somewhat more 
straightforward . 
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Lemma 13 Let M be a finite or countably infinite set of linear transformations 
M : M" — >■ H and e = (ei,...,e„) a vector of orthonormal random variables 
(satisfying Ee^ej = 6ij) with values in [—1, 1]. Then for p > 2 

Y^WMer] <v^iY.\\Mr„s] ■ 

\MeM ) \MeM ) 

Proof. First note that by standard results on the absolute moments of the 
normal distribution 



t^-^ exp 



~t' 



dt<J- 2)!! < J- (1 . 3 • ... - 2) < J^p'^'-\ 



so 



i/p 



i/p 



(10) 



Jensen's inequality and integration by parts give 



\AieM 



J2 P rPT{\\Me\\>t}tP-'dt] 
\MeM "^0 / 



< 



{2P Y rtP-'expl 
\ MeM -^0 \ 



~t^ 



i/p 



4IIMI 



dt 



HS , 



where Lemma [10] (ii) was used in the last step with r = 2. A change of variables 
t^t/ {V2\\M\\hs) gives 



EfYUlef] < (2prt^-'exp(^)dtY^'^'\\MrHs] 
\MeM I \ -Jo \ / MeM J 



i/p 



2 



where we use ((TO)) in the last inequality. 



Proof of Theorem ll2l As in the proof of Theorem[2]we proceed using duality 
and apply Lemma [T3l to the set of transformations A4x = {Mx : M G Ai}. 



7^>f, (x) < -eII^, 

^ 71. W ^ ^ 



2 

= -E 
M„* n 



i/p- 



< 



< 



2 

n 

2 

n 



i/p 



2^ ( E p^-iIh.) = 1, f E (e p^-'II')'' I 

\MeM / \ \MeM \ i J I 

; 2vrpEf E (P^-^llT'T' 

\ i \A/eA4 / 



2/p 



23/2 

n ' 
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where the last inequality is just the triangle inequality in i'p/2 • ■ 

5 Conclusion and Future Work 

We have presented a bound on the Rademacher average for linear function 
classes described by infimum convolution norms which are associated with a class 
of bounded linear operators on a Hilbert space. We highlighted the generality 
of the approach and its dimension independent features. 

When the bound is applied to specific cases {£2, ^i, mixed £1/(2 norms) it 
recovers existing bounds (up to small changes in the constants). The bound 
is however more general and allows for the possibility to remove the "log(i" 
factor which appears in previous bounds. Specifically, we have shown that the 
bound can be applied in infinite dimensional settings, provided that the moment 
condition ^ is satisfied. We have also applied the bound to multiple kernel 
learning. While in the standard case the bound is only slightly worse in the 
constants, the bound is potentially smaller and applies to the more general case 
in which there is a countable set of kernels, provided the expectation of the sum 
of the kernels is bounded. 

An interesting question is whether the bound presented is tight. As noted in 
[3] the "logd" is unavoidable. This result immediately implies that our bound 
is also tight, since we may choose i?^ = d in equation 

A potential future direction of research is the application of our results in the 
context of sparsity oracle inequalities. In particular, it would be interesting to 
modify the analysis in [H], in order to derive dimension independent bounds. 
Another interesting scenario is the combination of our analysis with metric 
entropy. 
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